Splitting sentences in C# using Stanford.NLP

October 20, 2014, 11:17 pm by Rhyous

So I need to break some sentences up. I have a pretty cool regex that does this, however, I want to try out Stanford.NLP for this. Let’s check it out.

Create a Visual Studio C# project.
I chose a New Console Project and named it SentenceSplitter.
Right-click on the project and choose “Manage NuGet Packages.
Add the Stanford.NLP.CoreNLP nuget package.

Add the following code to Program.cs (This is a variation of the code provide here: http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordCoreNLP.html

using edu.stanford.nlp.ling;
using edu.stanford.nlp.pipeline;
using java.util;
using System;
using System.IO;
using Console = System.Console;

namespace SentenceSplitter
{
    class Program
    {
        static void Main(string[] args)
        {
            // Path to the folder with models extracted from `stanford-corenlp-3.4-models.jar`
            var jarRoot = @"stanford-corenlp-3.4-models\";

            const string text = "I went or a run. Then I went to work. I had a good lunch meeting with a friend name John Jr. The commute home was pretty good.";

            // Annotation pipeline configuration
            var props = new Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
            props.setProperty("sutime.binders", "0");

            // We should change current directory, so StanfordCoreNLP could find all the model files automatically 
            var curDir = Environment.CurrentDirectory;
            Directory.SetCurrentDirectory(jarRoot);
            var pipeline = new StanfordCoreNLP(props);
            Directory.SetCurrentDirectory(curDir);

            // Annotation
            var annotation = new Annotation(text);
            pipeline.annotate(annotation);

            // these are all the sentences in this document
            // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
            var sentences = annotation.get(typeof(CoreAnnotations.SentencesAnnotation));
            if (sentences == null)
            {
                return;
            }
            foreach (Annotation sentence in sentences as ArrayList)
            {
                Console.WriteLine(sentence);
            }
        }
    }
}

Warning! If you try to run here, you will get the following exception: Unrecoverable error while loading a tagger model

java.lang.RuntimeException was unhandled
  HResult=-2146233088
  Message=edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model
  Source=stanford-corenlp-3.4
  StackTrace:
       at edu.stanford.nlp.pipeline.StanfordCoreNLP.4.create()
       at edu.stanford.nlp.pipeline.AnnotatorPool.get(String name)
       at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(Properties A_1, Boolean A_2)
       at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props, Boolean enforceRequirements)
       at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props)
       at SentenceSplitter.Program.Main(String[] args) in c:\Users\jbarneck\Documents\Projects\NLP\SentenceSplitter\SentenceSplitter\Program.cs:line 20
       at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
       at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
       at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
       at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
       at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()
  InnerException: edu.stanford.nlp.io.RuntimeIOException
       HResult=-2146233088
       Message=Unrecoverable error while loading a tagger model
       Source=stanford-corenlp-3.4
       StackTrace:
            at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(Properties config, String modelFileOrUrl, Boolean printLoading)
            at edu.stanford.nlp.tagger.maxent.MaxentTagger..ctor(String modelFile, Properties config, Boolean printLoading)
            at edu.stanford.nlp.tagger.maxent.MaxentTagger..ctor(String modelFile)
            at edu.stanford.nlp.pipeline.POSTaggerAnnotator.loadModel(String A_0, Boolean A_1)
            at edu.stanford.nlp.pipeline.POSTaggerAnnotator..ctor(String annotatorName, Properties props)
            at edu.stanford.nlp.pipeline.StanfordCoreNLP.4.create()
       InnerException: java.io.IOException
            HResult=-2146233088
            Message=Unable to resolve "edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as either class path, filename or URL
            Source=stanford-corenlp-3.4
            StackTrace:
                 at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(String textFileOrUrl)
                 at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(Properties config, String modelFileOrUrl, Boolean printLoading)
            InnerException:

Download the stanford-corenlp-full-3.4.x.zip file from here: http://nlp.stanford.edu/software/corenlp.shtml#Download
Extract the stanford-corenlp-full-2014-6-16.x.zip.
Note: Over time, as new versions come out, make sure the version you download matches the version of your NuGet package.
Extract the stanford-corenlp-3.4-models.jar file to stanford-corenlp-3.4-models.
I used 7zip to extract the jar file.
Copy the stanford-corenlp-3.4-models folder to your Visual Studio project files.
Note: This is one way to include the jar file in your project. Other ways might be a copy action or another good way would be to use an app.config appSetting. I chose this way because it makes all my files part of the project for this demo. I would probably use the app.config method in production.
In Visual Studio, use ctrl + left click to highlight the stanford-corenlp-3.4-models folder and all subfolders.
Open Properties (Press F4), and change the namespace provider setting to false.
In Visual Studio, use ctrl + left click to highlight the files under the stanford-corenlp-3.4-models folder and all files in all subfolders.
Open Properties (Press F4), and change the Build Action to Content and the Copy to Output Directory setting to Copy if newer.
Run the code.

Note: At first I tried to just load the model file. That doesn’t work. I got an exception. I had to set the @jarpath as shown above. I needed to copy all the contents of the jar file.

Results

Notice that I through it curve ball by ending a sentence with Jr. It still figured it out.

I went or a run. Then I went to work. I had a good lunch meeting with a friend name John Jr. The commute home was pretty good.

However, I just tried this paragraph and it did NOT detect the break after the first sentence.

Exit Room A. Turn right. Go down the hall to the first door. Enter Room B.

I am pretty sure this second failure is due to the similarity in string with a legitimate first name, middle initial, last name.

Jared A. Barneck
Room A. Turn

Now the question is, how do I train it to not make such mistakes?

Category: csharp, NLP | Comment (RSS) | Trackback

8 Comments

khosro says:

December 21, 2023 at 7:13 pm

HI, I GET THIS ERROR IN SAMPLE CODE, PLEASE HELP ME

edu.stanford.nlp.io.RuntimeIOException
HResult=0x80131500
Message=Error while loading a tagger model (probably missing model file)
Source=stanford-corenlp-4.5.0
StackTrace:
at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(Properties config, String modelFileOrUrl, Boolean printLoading)
at edu.stanford.nlp.tagger.maxent.MaxentTagger..ctor(String modelFile, Properties config, Boolean printLoading)
at edu.stanford.nlp.tagger.maxent.MaxentTagger..ctor(String modelFile)
at edu.stanford.nlp.pipeline.POSTaggerAnnotator.loadModel(String , Boolean )
at edu.stanford.nlp.pipeline.POSTaggerAnnotator..ctor(String annotatorName, Properties props)
at edu.stanford.nlp.pipeline.AnnotatorImplementations.posTagger(Properties properties)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$6(Properties , AnnotatorImplementations )
at edu.stanford.nlp.pipeline.StanfordCoreNLP.__Anon7.apply(Object , Object )
at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$null$33(Entry , Properties , AnnotatorImplementations )
at edu.stanford.nlp.pipeline.StanfordCoreNLP.__Anon41.get()
at edu.stanford.nlp.util.Lazy.3.compute()
at edu.stanford.nlp.util.Lazy.get()
at edu.stanford.nlp.pipeline.AnnotatorPool.get(String name)
at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props, Boolean enforceRequirements, AnnotatorPool annotatorPool)
at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props, Boolean enforceRequirements)
at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props)
at stanfordparsers.Program.Main(String[] args) in C:\Users\mohammadhossein\source\repos\stanfordparsers\Program.cs:line 25

This exception was originally thrown at this call stack:
[External Code]

Inner Exception 1:
IOException: Unable to open "edu/stanford/nlp/models/pos-tagger/english-left3words-distsim.tagger" as class path, filename or URL

Reply to this comment
Muhammad Javed says:

August 20, 2017 at 10:46 pm

I followed the steps mentioned above to develop POS taagger in C# but every time I am getting the following error..

SLF4J: Failed to load class "ord-slf4j.iml.staticloggerbinder".
SLF4J: defaulting to no-operation (NOP) logger implementation
SLF4J: se http://www.slf4j.org/codes.html#staticloggerbinder for further details.

Reply to this comment
machine a sous aladin says:

November 11, 2014 at 9:09 am

machine a sous aladin

Rhyous

Reply to this comment
resume services melbourne west says:

October 29, 2014 at 10:13 pm

resume services melbourne west

Rhyous

Reply to this comment
Vincent Mitchell says:

October 22, 2014 at 5:12 pm

Also common abbreviations in addresses like Rd. St. Ln. There are probably at least several tens of those.

Reply to this comment
Vincent Mitchell says:

October 21, 2014 at 9:39 am

Answer, rigid input structure requires 2 spaces after sentence, on after everything else. So now if your input is structured properly, the parser can distinguish it. Other alternatives for end of sentence apply, such as line break, etc.

Reply to this comment
- Rhyous says:
  
  October 21, 2014 at 9:53 pm
  
  For the Stanford NLP at least, two spaces after did not change the result.
  
  Reply to this comment
  - Vincent Mitchell says:
    
    October 22, 2014 at 5:10 pm
    
    Probably interpreting white space as 1 regardless of the spaces. So you could put a result set in there that it checks againse. If the period follows space and one letter, it's likely not a sentence ender. That would cover initials. Also add case insensitive prefixes exclusion for strings like 'Jr.', 'Sr.', 'Mrs.' etc... Don't forget state abbreviation lookups if you want to exclude those. Bottom line, you'll probably always miss some oddball scenario where one of the period terminated non-sentence parts of our lovely language is not perceived correctly.
    
    Reply to this comment

Rhyous

Knight of the Code

Splitting sentences in C# using Stanford.NLP

Results

Like this:

Related

8 Comments

Leave a Reply

Are you a Jeek?

Categories

Recent Posts

My other blogs